Online Reinforcement Learning for Real-Time Exploration in Continuous State and Action Markov Decision Processes

نویسندگان

  • Ludovic Hofer
  • Hugo Gimbert
چکیده

This paper presents a new method to learn online policies in continuous state, continuous action, model-free Markov decision processes, with two properties that are crucial for practical applications. First, the policies are implementable with a very low computational cost: once the policy is computed, the action corresponding to a given state is obtained in logarithmic time with respect to the number of samples used. Second, our method is versatile: it does not rely on any a priori knowledge of the structure of optimal policies. We build upon the Fitted Q-iteration algorithm which represents the Q-value as the average of several regression trees. Our algorithm, the Fitted Policy Forest algorithm (FPF), computes a regression forest representing the Q-value and transforms it into a single tree representing the policy, while keeping control on the size of the policy using resampling and leaf merging. We introduce an adaptation of Multi-Resolution Exploration (MRE) which is particularly suited to FPF. We assess the performance of FPF on three classical benchmarks for reinforcement learning: the ”Inverted Pendulum”, the ”Double Integrator” and ”Car on the Hill” and show that FPF equals or outperforms other algorithms, although these algorithms rely on the use of particular representations of the policies, especially chosen in order to fit each of the three problems. Finally, we exhibit that the combination of FPF and MRE allows to find nearly optimal solutions in problems where -greedy approaches would fail.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bandit-Based Planning and Learning in Continuous-Action Markov Decision Processes

Recent research leverages results from the continuous-armed bandit literature to create a reinforcement-learning algorithm for continuous state and action spaces. Initially proposed in a theoretical setting, we provide the first examination of the empirical properties of the algorithm. Through experimentation, we demonstrate the effectiveness of this planning method when coupled with exploratio...

متن کامل

Utilizing Generalized Learning Automata for Finding Optimal Policies in MMDPs

Multi agent Markov decision processes (MMDPs), as the generalization of Markov decision processes to the multi agent case, have long been used for modeling multi agent system and are used as a suitable framework for Multi agent Reinforcement Learning. In this paper, a generalized learning automata based algorithm for finding optimal policies in MMDP is proposed. In the proposed algorithm, MMDP ...

متن کامل

Operation Scheduling of MGs Based on Deep Reinforcement Learning Algorithm

: In this paper, the operation scheduling of Microgrids (MGs), including Distributed Energy Resources (DERs) and Energy Storage Systems (ESSs), is proposed using a Deep Reinforcement Learning (DRL) based approach. Due to the dynamic characteristic of the problem, it firstly is formulated as a Markov Decision Process (MDP). Next, Deep Deterministic Policy Gradient (DDPG) algorithm is presented t...

متن کامل

Online exploration in least-squares policy iteration

One of the key problems in reinforcement learning is balancing exploration and exploitation. Another is learning and acting in large or even continuous Markov decision processes (MDPs), where compact function approximation has to be used. In this paper, we provide a practical solution to exploring large MDPs by integrating a powerful exploration technique, Rmax, into a state-of-the-art learning...

متن کامل

Convergent Reinforcement Learning for Hierarchical Reactive Plans

Hierarchical reinforcement learning techniques operate on structured plans. Although structured representations add expressive power to Markov Decision Processes (MDPs), current approaches impose constraints that force the associated convergence proofs to depend upon a subroutinestyle execution model that restricts adaptive response. We develop an alternate approach to convergent learning that ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1612.03780  شماره 

صفحات  -

تاریخ انتشار 2016